# Multimodal Instruction Fine-tuning

Llama 3.2 11B Vision Radiology Mini
This is a multimodal model based on the Llama architecture, supporting vision and text instructions, optimized with 4-bit quantization.
Image-to-Text
L
p4rzvl
69
0
Smolvlm2 2.2B Instruct 4bit
Apache-2.0
SmolVLM2-2.2B-Instruct-4bit is a vision-language model based on MLX format conversion, focusing on video text-to-text tasks.
Image-to-Text Transformers English
S
smdesai
24
1
Kowen Vol 1 Base 7B
Apache-2.0
A Korean vision-language model based on Qwen2-VL-7B-Instruct, supporting image-to-text tasks
Image-to-Text Transformers Korean
K
Gwonee
22
1
Med CXRGen I
Apache-2.0
Med-CXRGen-I is a multimodal large language model fine-tuned based on LLaVA-v1.5-7B, specializing in the task of generating radiology reports from chest X-ray images, particularly the impression section.
Image-to-Text Transformers
M
X-iZhang
86
1
Med CXRGen F
Apache-2.0
Med-CXRGen-F is a multimodal large language model fine-tuned based on LLaVA-v1.5-7B, specifically designed for radiology report generation tasks, particularly the automatic generation of chest X-ray examination results.
Image-to-Text Transformers
M
X-iZhang
86
1
Qwen2 VL 7B SafeRLHF
Apache-2.0
Qwen2-VL-7B-Instruct is a multimodal large language model fine-tuned on the SafeRLHF dataset, focusing on visual question answering tasks with an emphasis on safety.
Text-to-Image Safetensors English
Q
Foreshhh
1,630
2
Xgen Mm Phi3 Mini Instruct Dpo R V1.5
Apache-2.0
xGen-MM is a series of multimodal foundation models developed by Salesforce AI Research, improved based on the BLIP series, and trained on high-quality image captions and interleaved image-text data.
Image-to-Text Safetensors English
X
Salesforce
305
18
Chartgemma
MIT
ChartGemma is a chart understanding and reasoning model built upon PaliGemma, capable of directly processing chart images through visual instruction fine-tuning to capture visual trends and underlying information.
Image-to-Text Transformers English
C
ahmed-masry
1,243
41
Xgen Mm Phi3 Mini Instruct R V1
xGen-MM is the latest foundational large multimodal model series developed by Salesforce AI Research, based on improvements to the BLIP series, featuring powerful image understanding and text generation capabilities.
Image-to-Text Transformers English
X
Salesforce
804
186
Llava Med 7b Delta
Other
LLaVA-Med is a biomedical multimodal model constructed through visual instruction fine-tuning, capable of processing biomedical images and text.
Text-to-Image Transformers
L
microsoft
257
67
OTTER MPT7B Init
MIT
OTTER-MPT7B-Init is a set of weights for initializing Otter model training, converted directly from Openflamingo.
Text-to-Image Transformers
O
luodian
53
3
Blip Image Captioning
This is an image captioning model based on the BLIP architecture, capable of generating concise textual descriptions for input images.
Image-to-Text Transformers
B
nnpy
17
6
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase